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ABSTRACT 


Entity Linking (EL) aims to automatically link the mentions in unstructured documents to corresponding 
entities in a knowledge base (KB), which has recently been dominated by global models. Although many 
global EL methods attempt to model the topical coherence among all linked entities, most of them failed in 
exploiting the correlations among manifold knowledge helpful for linking, such as the semantics of mentions 
and their candidates, the neighborhood information of candidate entities in KB and the fine-grained type 
information of entities. As we will show in the paper, interactions among these types of information are very 
useful for better characterizing the topic features of entities and more accurately estimating the topical 
coherence among all the referred entities within the same document. In this paper, we present a novel 
HEterogeneous Graph-based Entity Linker (HEGEL) for global entity linking, which builds an informative 
heterogeneous graph for every document to collect various linking clues. Then HEGEL utilizes a novel 
heterogeneous graph neural network (HGNN) to integrate the different types of manifold information and 
model the interactions among them. Experiments on the standard benchmark datasets demonstrate that 
HEGEL can well capture the global coherence and outperforms the prior state-of-the-art EL methods. 


* Corresponding author: Yuting Wu (Email: wyting@pku.edu.cn; ORCID:0000-0002-7550-3804). 
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1. INTRODUCTION 


Entity Linking (EL) is the task of mapping entity mentions with specified context in an unstructured 
document to corresponding entities in a given Knowledge Base (KB), which bridges the gap between 
abundant unstructured text in large corpus and structured knowledge source, and therefore supports many 
knowledge-driven natural language processing (NLP) tasks and their methods, such as question answering 
[1], text classification [2], information extraction [3] and knowledge graph construction [4]. 


Recently, EL task has been dominated by the global methods [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], which 
model the topical coherence among the linked entities of mentions in the same document. Global 
information relies on the semantic and topical coherence of entities related to various mentions in the same 
document, which is integrated with local mention-contextual information by most state-of-the-art models 
to alleviate the biases from local contextual information. For instance, as shown in Figure 1, for linking 
the mention “England”, it is difficult to decide between the candidate entities England national football 
team and England national rugby union team when only using the surrounding sports-related local context 
where there are the scores of matches or the name of stadium, which may contain noises and lead the 
linking result to the more popular but wrong candidate England national football team. However, if an 
EL model can capture the topical coherence of the common topic “rugby” among all the mentions 
“Scotland”, “Murrayfield”, “Cuttitta” and “England” in the current paragraph, such as taking the nearby 
mention Cuttitta into consideration, which is linked to the candidate Marcello Cuttitta, a former Italian 
rugby union player, the model can correctly link the mention “England” to the candidate England national 
rugby union team. 


Topic: Rugby union team Country Sport stadium Rugby union player Rugby union team Football team 
Scotland national Scotland Murrayfield Marcello England national | England national 
rugby union team . Stadium Cuttitta rugby union team | football team 
CUTTITTA BREKEOR ITALY APTERAVEAR. ROMET996-12-06. Italy recalled Marcello Cutti riday for 
their friendly against Scotland at Murrayfield than a year after the 30-yeayld wing ai inced he was retiring 
following differences over selection. Cuttitta, who trainer George Coste said Was certeiti to play on Saturday week, was 
named in a 21-man squad lacking only two of the team beaten 54-21 by England at Twickenham last month. 


Figure 1. The illustration example. By considering the topical coherence, an EL model can accurately link the 
mentions “Scotland”, “Murrayfield”, “Cuttitta” and “England” to their corresponding entities (in bold) that share 
the common topic “rugby”. 


Although prior global EL approaches have greatly boosted the performance of local models, most of them 
do not simultaneously consider multiple types of useful information and the interactions among them, such 
as the semantics of mentions and their candidates, the neighborhood information of candidate entities in 
KB and the fine-grained type information of entities, when modeling the global coherence, and thus fail to 
precisely estimate the coherence among referred entities. As we will show in the paper, effectively modeling 
the interactions among the manifold linking knowledge can help to better model the topical coherence 
and achieve more accurate EL. 
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Most recently, some global methods [14, 16] construct a document-level graph with candidate entities 
of the mentions as nodes and exploit Graph Convolutional Networks (GCN) [17] on the graph to integrate 
the global information, delivering promising results. Inspired by the effectiveness of using GCN to model 
the global signal, we present HEterogeneous Graph-based Entity Linker (HEGEL), a novel global EL 
framework designed to model the interactions among manifold heterogeneous information from different 
sources by constructing a document-level informative heterogeneous graph and applying a heterogeneous 
architecture in GNN aggregation operation. We first constructed a document-level informative heterogeneous 
graph with mentions, candidate entities, and neighbors of entities and extracted keywords as nodes, and 
we created different types of edges to link these different types of nodes. Then we applied a meticulously 
designed heterogeneous graph neural network (HGNN) on the constructed heterogeneous graph to encode 
the global coherence, which allows information propagation along the informative graph structure and 
encourages sufficient interactions among different types of information. Followed by traditional scoring 
combining and ranking procedure, our model can be trained to use the information under an end-to-end 
fashion. 


Our contributions can be summarized as follows: 


e We designed a novel approach to construct a document-level informative heterogeneous graph to 
collect manifold linking knowledge from different sources to support the linking process. 

e We proposed a meticulously designed heterogeneous graph neural network on the constructed graph, 
which integrates different sources of information and encourages sufficient interactions among them, 
more precisely characterizing the topic features of candidate entities and better capturing the topical 
coherence. To the best of our knowledge, this is the first work to employ a heterogeneous graph neural 
network in Entity Linking tasks. 

e Extensive experiments and analysis on six standard EL datasets demonstrate that our HEGEL achieves 
state-of-the-art performance over mainstream EL methods. 


2. RELATED WORK 
2.1 Entity Linking 


Most existing models not only use local methods relying on local context of individual mentions 
independently [18, 19, 20, 21, 22, 23], but also use global methods considering the coherence among the 
linked entities of all mentions by jointly linking on the whole document [9, 13]. Most local methods make 
use of extracted local features through feature engineering, which includes pair-wise statistic features, like 
Wikipedia linking frequency, and the similarity scores between mentions and candidate entities, like the 
mention-entity similarities implemented as cosine similarities between document local contexts and entity 
Wikipedia titles in [13]. Recently, Pretrained Language Models (PLMs), which achieve leading performance 
in other natural language processing tasks, are also used in local linking models. The PLM-based linking 
models focus on unique settings, such as zero-shot [22] and multilingual [23] scenarios, to exploit the 
superiority of PLMs in understanding tasks under these settings. To alleviate the noise led by the local 
information, global methods try to model the semantic coherence and relationship between linked entities 
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within the same document. As the global coherence optimization problem is NP-hard, different 
approximation methods are often used. Apart from traditional methods like loopy belief propagation [8, 
11], several works approximate the problem into sequence decision problem [6] or graph learning [5, 7, 
14, 16]. 


Following the graph based neural network modeling methods, HEGEL expands the graph utilization in 
EL task to heterogeneous style, which not only enjoys the strong representation ability of heterogeneous 
graph structure, but also becomes effective enough because of avoiding other additional inference steps 
required in sequence-style models. 


2.2 Graph Neural Networks 


Graph Neural Network (GNN) is a strong and flexible framework to learn on data with graph structure. 
After the Graph Convolutional Network (GCN) [17] appeared, GNN is more and more widely used in many 
tasks, while several popular GNN architectures, such as GraphSAGE [24] and GAT [25], are proposed to 
learn the representation on graphs. The natural graph structure entailed in EL task becomes a favorable 
condition to apply GNN methods to model the global information. NCEL [5] performs GCN on constructed 
subgraphs for every mention, where the nodes are entity candidates of current mention and surrounding 
mentions with edges linked from the former to the latter. SGEL [7] combines the features of mention-by- 
mention sequential model and GAT by building a graph containing previous predicted entities, current 
candidates and later unpredicted mention candidates as nodes. GNED [16] builds a homogeneous graph 
by embedding the entities and words into the same vector space, and extracts words from the description 
and context in KB for every candidate to form the nodes and edges. 


As the emergence of massive heterogeneous information, many works about Heterogeneous GNN have 
been proved to be effective. The mainstream of HGNN models is based on the construction of metapaths 
[26], but several HGNN architectures free of metapath are proposed recently [10]. Our HEGEL follows 
these works, and utilizes the heterogeneous structure to model the interactions among different types of 
linking information. 


3. PROBLEM FORMULATION 


Given a list of entity mentions M = {m,,...m} in a document D, the EL task can be formulated as linking 
each mention m; to its corresponding entity & from the entity collection € of KB or NIL (i.e, & = NIL, 
which means the mention m; cannot be linked to any corresponding entity in € reasonably). Generally 
speaking, EL methods usually consist of two stages. 


3.1 Candidate Generation Stage 


EL tasks generally start with generating a small list of candidate entities C, = fe rere LCE for the 


mention m; because of the unacceptable computation cost to traverse over the whole entity collection £. 
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For candidate generation, we used the method proposed in [8, 11], which simply uses (1) computed 
mention-entity prior P(e |m) by averaging probabilities from mention entity hyperlink statistics of Wikipedia; 
and (2) the local context-entity similarity, which is simply calculated as the similarity between candidate 
entity embeddings and average embeddings of context words. 


This stage aims to contain the correct entity é, into C, and the ratio of candidate lists containing 
corresponding entity is referred to as the recall of candidate generation. 


3.2 Candidate Disambiguation Stage 


In this stage, EL methods assign a score calculated in EL model to each candidate €, and select the 
top-ranked candidate as the predicted answer, or predict NIL under some specified situations. Most EL 
methods, including this work, focus on improving performance at this stage. As mentioned in Section 2.1, 
different local and global models are used to calculate the linking scores. Local methods focus on the 
corresponding mention itself, regardless of other mentions or linked entities. That is to say, the local methods 
deal with the linking problem by independent calculation for every mention: 


e; = argmax, ec Piocal€ 1M), (1) 


Sig eG 


where Y,a is a scoring function for the mention-entity pair. Different from the local methods without 


Loca 


inter-mention interaction, interdependencies measured by global methods can be generally represented as 
the coherence scoring function which takes into account entity topic coherence: 


E = argmaxc xc, PE, D) + XP ocal@), m), (2) 


i=1 
where £E' = {e,,...,e,} is the predicted entity list for entity mentions M of document D, and ®(E,D) is the 
global function measuring how the entities cohere with each other. 


4. THE PROPOSED APPROACH 


In addition to separately encoding the local features for every mention within a document as local models 
do, HEGEL constructs an informative heterogeneous graph for each document and then applies a 
heterogeneous GNN on it, which encodes the global coherence based on different types of information. 
Finally, HEGEL combines the local and global features and generates a final score for each mention- 
candidate pair. 


Figure 2 gives an overview of HEGEL that follows a four-stage processing pipeline: (a) encoding local 
features for each candidate independently, (b) informative graph construction for the document, (c) applying 
heterogeneous GNN on the graph, and (d) combining local and global features for scoring. 


202211.00384v1 


chinaXiv 


ChinaXiv /ERATY 
Integrating Manifold Knowledge for Global Entity Linking with Heterogeneous Graphs 


(b) Informative Mentions Candidat ae i i pren . ses 
country Keyword Prope” ~ Propa, +. 
Graph Scotland j=- — — — — zj Scotland (x) ; mean Ha Ox A GNA 
Construction ———————" `s jf 2 | aye Nh aa a 
Onstruchon — Neighbors ~ A a national rugby PTR NLR 
Py? 


—— a e ` : 1 { rm)! È 
Murrayfield f} [United Kindom K / union team (v) gDy \ Į XA SPAH m) 
—_—— £ = =. \ SS E f E S | / \sgr | ”% 
| | Rugby union A Murrayfield Stadium( v)| | ıı ] | Nx A ~t- 
| England -4 > Marcello Cuttitta(/) á (c) Applying 
Cuttita f -e Í  stadiun HGNN 


an O \NY England national football 


Edinburgh N team( X ) 


~ 


il (a) Encoding 


a 2 OEE talian, player A 

| em _} f \| England national rugby Local $ 
ngiand =. qap ont om m | 

= union team( v ) | association) football i|| Features é 


L 


Figure 2. The overall framework of our proposed model HEGEL with a real experiment case. The blue nodes 
denote the mentions in one document; the orange nodes denote the candidate entities; the black nodes indicate 
the neighbors extracted from Wikipedia KB; and the green nodes indicate the keywords extracted from the first 
sentence of entities in Wikipedia. The heterogeneous graph in right part (c) can provide the discriminative linking 
information through the flow on the topological connections. 


4.1 Encoding Local Features 


Given a mention m; in D and a candidate entity e, EG, HEGEL computes three types of local features 
to encode the local mention-entity compatibility. These features consist of (a) the Mention-Entity Prior 
P(e, |m;), which has been used in candidate generation stage, as referred in Section 3.1; (b) the Context 
Similarity Pele Cm)” which utilizes an attention neural network to compute the similarity between 
candidate e, and local context Cm, = (Wair Wei} surrounding m, by selecting K most relative words from 


Cm, eliminating noisy context words from computation: 
J! 


u(w,) = maxv Avu, i (3) 
am ={we Cr; | u(w) € topK(u)}, (4) 
u(w.) 


Pele, Cn) = > Aw)? Bv,, (6) 
I ey Ik 
where va, vw entity embeddings and word embeddings trained in [8], and diagonal matrix A, B are both 


trainable ; (c) the Type Similarity ‘¥;(e, ,m,), which estimates the similarity between the types (PER, GPE, 
ORG and UNK) of m, and e, by training a typing system proposed in [15]: 
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Type(m,), Typele,, ) e (PER, GPE, ORG, UNK}, (7) 
Vim, = Emb,(Type(m,)), (8) 

Vrp = Emb, (Typele, )), (9) 
¥,(e,,m,) = Vim VT e (10) 


where Emb,(t) is trainable type embedding for type t. As the mentions and entities use the same embedding 
set, m, and e, with the same type will have a higher ¥; than other different types. 


4.2 Informative Heterogeneous Graph Construction 


For the document D, HEGEL builds an informative heterogeneous graph Gp to collect different types of 
linking clues. 


As shown in Figure 2, Gp = <Vp, Ep> contains three types of nodes: mention nodes Vuen entity nodes 


Vin and keyword nodes Vwa. Therefore, the node set Vp = Vuen U Vent U Vora: Vivene is naturally composed 
IM| 


of all mentions m; in D. Ve contains two parts of entities: the mention candidates Vn, =G where 


duplicate is removed, and the common neighbors in KB of at least two candidate entities in V;,,1, or formally 
Vin 2 = AV |AV V2 © Vener V1 £ Var (Viel V), Var, v) € KB,V € Ven}. AS reserving all neighbors in KB of Vem is 
computationally unacceptable, we eliminate those nodes with only one neighbor in V;,.1 from Vin because 
neighbors bridging two candidates are more informative for determining the relation between candidates, 
which is theoretically explained and experimentally proved in [12, 27]. Vwo Consists of the keywords 
extracted from the Wikipedia page of each candidate in Va. We found that the first sentence on the 
Wikipedia page of an entity usually contains more fine-grained type information of the entity, which is a 
very useful linking clue. Therefore, for e in V;,,;, we extracted the first sentence s from its Wikipedia page, 
found the first link verb in s, and picked the continuous phrase immediately after the link verb, which 
contains nouns, adjectives and conjunctions only. We regarded the words in the picked phrase, except 
stopwords, as keywords characterizing the fine-grained type of e, and added them into Vwora- 


After the node set Vp is generated, HEGEL creates heterogeneous edges between nodes of the same or 
different types by following rules: (a) the edges between two mention nodes Eum C Vuent X Vmem are Created 
between adjacent mentions (m,,m,,;) in D; (b) the edges between two entity nodes Ese C Vim X Vem are 
created while there is a relation between them in KB; (c) the edges between two word nodes Eww C Vwoa 
x Vwo are created while the cosine similarity of two word embeddings is higher than a given threshold g; 
(d) the edges from entities to mentions Egy C Veni X Vuen are consistent with the mention-candidate relation; 
(e) the edges from words to entities Ewe C Vwora X Ven: are created while the word is one of the keywords 
for the entity. Note that (d) and (e) are uni-directional while (a)-(c) are bi-directional, and the performance 
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of constructing bi-directional edges for (d) and (e) will be discussed later. In short, the entire edge set can 
be represented as Ep = Emm U Ere U Eww U Epp U Ewe- 


4.3 Heterogeneous Graph Neural Network 


Given a constructed heterogeneous informative graph Gp, HEGEL applies a designed heterogeneous 
graph neural network (HGNN) on it to integrate different sources of manifold information and encourage 
the interactions among them, generating information-augmented embeddings of Vuen and Ven, for later 
candidate scoring and ranking. 


In order to avoid the requiring of expertise knowledge and information loss led by the former metapath- 
based HGNN methods, we designed a novel metapath-free HGNN model. For the heterogeneous graph 
Gp, we represent an edge e € Ep from node i € Vp to node j € Vp with edge type r as (i, j, r). Note that in 
our informative graph, the node type (t, t) can exclusively determine the edge type r, and therefore we 
denote (i, t) as r in following explanation. 


4.3.1 Node Embeddings 


For a mention node v„ € Vijen: , We used a text convolutional neural network (CNN) on the local context 


lent 7 


Cr, surrounding m, to compute the initial embeddings h? € Renn *th ; 


Ta EV CONNU y peen) (12) 


wmn Mel 


m; 


i len(m;) vem, 


$ 
d x z + š 
where v,,,v,, ER” are corresponding word embeddings of mention surface words and the mention’s 
I 
Coy 1 respectively, and [;] is concatenating operation. For the nodes in Viw aNd Vuen we naturally used the 


entity word embeddings v, €R and v, eR“ trained in [8] as initial embedding h? ,h? . 


4.3.2 Inter-Node Propagation 


A node should receive different types of information from its heterogeneous neighborhood in different 
ways. Motivated by previous work about metapath [26], HEGEL models the different types of information 
propagation with multiple feature transformations on different adjacent relations. Taking edge type r = (t, 
t) into consideration, a node v; with type t, collects information from its neighborhood N(v,) with type t; in 
/-th layer by a Graph Convolutional Network (GCN): 


ees X wah, (13) 


he 
Z vie, AN) a 
1 


N 
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d d Xde . o 
where A! eR" is v/s embedding before the /-th layer, W!, ER ty Mi is a trainable matrix in the /-th 
i irj 


d It; . . . . . 
layer, hos eR `" is v/s new embedding related to t, and Z is the normalization factor. Note that for 


edge types (t, t) connecting nodes with the same type, self-loop connections are added into its edge set. 


4.3.3 Intra-Node Aggregation 


In order to preserve the information from different types of relationship with neighborhoods, for the node 
v, HEGEL aggregates new embeddings to generate the input he for next layer: 


hi = GU )) (14) 


“Gl Ls RI is the aggregation function transforming the |{t}| input embeddings to an aggregated 


where fg : R 

one, which is implemented as simple summation operation fa({xX}) = Lx). @ is an activation function 
di td a . . . . 

implemented as GELU(-) [28], and he eR! is the output embedding of the /-th layer containing all 


types of one-hop neighborhood of v, in heterogeneous graph structure. As all types of neighborhoods can 
affect the output of current layer and consequently the information propagation in next layer, we believe 
that, by encouraging full interactions among different types of information in this stage, the L layers of inter- 
node propagation and intra-node aggregation are able to encourage heterogeneous integrations and 
interactions among types of information, which are represented by the final output h, ; 


4.3.4 Global Score Calculation 


After obataining the information-augmented embeddings hi, for mention m,and h! for corresponding 
$ Ik 
candidate e, , we ensure that dı men = Aen HEGEL applies a bi-linear similarity calculation to represent 


the global compatibility between the mention-candidate pair: 


Palem) = (hy, Y Dh, (15) 


k 


d, xd, : : 5 . 
where DeR+Met Lint is a trainable diagonal matrix. 


4.4 Feature Combining and Model Training 


HEGEL combines local features and the global compatibility score to compute the linking score for each 
candidate e, of mention m; 


m,); ¥e(e, ,C ); ¥(e;,,m,); Fee, m,)1) (16) 
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where f is a two-layered fully connect neural network. The candidate e, with the highest final linking score 
Sim, e) is selected as the output linking result for m,. HEGEL links m; to NIL if and only if its candidate 


B i 


list C; = Ø, or rather, there is no corresponding entity to m, in KB entity set £. 


Following previous works, HEGEL attempts to make the ground truth entity & ranking higher than other 
candidates, and therefore minimizes the following margin-based ranking loss: 


L= > ¥ [y-5(m,,6)+ Sme, )l, (17) 


it | 7 
m; eD ip eG 


where y > 0 is the margin hyper-parameter, and [x], is equal to x when x > 0, or equal to O otherwise. 


5. EXPERIMENTS AND ANALYSIS 
5.1 Datasets 


Following previous EL practice, we evaluated HEGEL on the benchmark dataset AIDA CoNLL-YAGO [19] 
for training, validation and the in-domain testing. To examine its cross-domain generalization ability, we 
used five popular datasets for cross-domain testing: MSNBC [29], AQUAINT [30], ACE2004 [13], CWEB 
[9] and WIKIPEDIA [9]. Table 1 shows the statistics and corresponding recall of candidate generation of all 
datasets used in our experiments. 


Table 1. The statistics of used datasets. 


Dataset #Mentions #Docs #Ments / #Docs Recall (%) 
AIDA-train (train) 18448 946 19.50 100 
AIDA-A (valid) 4791 216 22.18 97.72 
AIDA-B (test) 4485 231 19.40 98.66 
MSNBC 656 20 32.80 98.48 
AQUAINT 727 50 14.54 94.09 
ACE2004 257 36 7.14 91.44 
CWEB 11154 320 34.86 91.90 
WIKIPEDIA 6821 320 21.32 93.21 


Note: Recall represents the ratio of ground truth entities appearing in the generated candidate lists of corresponding mentions in 
the datasets. 


5.2 Model Variant 


To examine our claim that the heterogeneous feature of GNN plays a crucial role in HEGEL, we 
implemented a semi-heterogeneous version of HEGEL, called HEGEL-semi, which shares the parameters of 
GCN about different node types in every layer, respectively, except the first layer, as the dimensions of input 
node embeddings are different and unable to be processed in the non-heterogeneous way: 


Lia “S. fa = 8 
| | ret 
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Too 1 0 p0 
hy, = Olhe (Ay, a) (19) 
h“ = ott X Wh), 121 (20) 
! Z vjeN(v;) í 


J 


As the K-1 parameter-sharing layers do not use different parameters to deal with different types of nodes, 
they do not enjoy the benefit from heterogeneous graph structure. Therefore, the performance of HEGEL- 
semi should be lower than HEGEL according to our claim about the effect of a heterogeneous method. 


5.3 Experiment Settings 


As we used the pre-trained Word2vec [31] word embeddings, and entity embeddings released by [8], 
the embedding dimension d, is fixed to 300. The hyper-parameters are manually tuned based on the 
validation performance on AIDA-A. CNN output dimension dem = 64, all informative graph embedding 
dimensions d, = 32, / = 1, ..., L, number of HGNN layers L = 2, margin y = 0.01, K = 40, dropout rate is 
set to 0.5, and Eww threshold e = 0.5. To confine the graph size within a computable range, all documents 
with more than 80 mentions will be split into several documents as average as possible. 


We used Adam optimizer to train HEGEL with a learning rate of « = 2e — 4. The model is evaluated per 
3 epochs, and the training process is terminated while the highest validation performance does not exceed 
10 evaluations. 


Because of achieving the best performance on AIDA-A, HEGEL-semi is implemented under the same 
settings with HEGEL. 


5.4 Compared Baselines 


To illustrate the effect of modeling the interactions among different types of information, we evaluated 
and compared the performance of our HEGEL with nine existing methods on in- and cross-domain datasets: 


e AIDA [19] builds a graph whose weights are coherent score and similarity, and applies traditional 
statistics method on it. 

e GLOW [13] designs several statistics features of both local and global with Wikipedia linking structure. 

e RI [18] provides an Integer Linear Programming (ILP) formulation of Wikification and incorporates 
the entity-relation inference problem. 

e WNED [9] builds disambiguation graphs and applies iterative random walks on it based on Information 
Theory. 

e Deep-ED [8] leverages learned neural representations based on local context windows for joint 
document-level entity linking. 
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e Ment-Norm [11] treats and exploits relations between entities as latent variables based on Deep-ED 


[8]. 


e GNED [16] applies GCN and CRF on a homogeneous graph with extracted words and entities as 


nodes. 


e NCEL [5] applies GCN on a bipartite to integrate both local contextual features and global information. 


e SGEL [7] builds a graph for every mention sequentially, containing previous linked entities and 
candidates of unpredicated mentions. 


It is worth noting that GNED claims they firstly construct a heterogeneous entity-word graph to model 


global information, but their nodes are not heterogeneous indeed as entity nodes share the same vector 


space with words. In addition, they do not apply any heterogeneous architecture in their GNN, as they 


regard all edges as the same type. Therefore, HEGEL is the first work to employ a heterogeneous GNN in 


EL tasks to our best knowledge. 


5.5 Main Results 


We report the performance of all the compared baselines and our HEGEL in Table 2. The top part shows 
the performance of non-GNN-based baselines, and other baselines are GNN-based. 


Table 2. Performance on in-domain (AIDA-B) and cross-domain datasets. 


In-domain Cross-domain 

Models AIDA-B MSNBC AQUA ACE CWEB WIKI 
Prior p(e|m) ES 89.3 83.2 84.4 69.8 64.2 
AIDA - 79 56 80 58.6 63 
GLOW - 75 83 82 56.2 67.2 
RI - 90 90 86 67.5 73.4 
WNED 89 92 87 88 77 84.5 
Deep-ED Vy 93.7 88.5 88.5 77.9 77.5 
Ment-Norm 93.07 93.9 88.3 89.9 77.5 78.0 
GNED 92.40 95.5 91.6 90.14 77.5 78.5 
NCEL 80 - 87 88 - - 
SGEL 83 80 88 89 - - 
HEGEL 93.65+0.1 93.19+0.2 85.87+0.3 89.33+0.4 73.25+0.3 75.54+0.1 
- W/O Vword 91.94+0.2 93.18+0.2 85.35+0.4 88.40+0.5 71.95+0.5 74.70+0.2 
- W/O Vin, 92.22+0.1 92.93+0.4 85.07+0.7 88.93+0.4 72.57+0.5 75.45+0.2 
HEGEL-semi 92.37+0.2 92.01+0.7 85.82+0.5 89.19+0.5 72.63+0.5 75.23+0.2 
Local 91.03 91.97 84.06 86.92 71.45 74.79 


Note: We show in-KB accuracy (%) for the in-domain datasets and micro-F, score (%) for the cross-domain datasets, respec- 


tively. For HEGEL we show std. deviation obtained over 3 runs. 
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The in-domain test dataset AIDA-B, which shares the similar data distribution with training dataset AIDA- 
train and validation dataset AIDA-A, is the most important benchmark. By modeling the latent relation 
between mentions and injecting entity coherence into it, which can be regarded as simply interaction 
between two types of information, Ment-Norm outperforms all baselines on AIDA-B. It shows that the 
interactions of heterogeneous information are beneficial for capturing global coherence. We observed that 
HEGEL, which integrates manifold linking knowledge in a more interactive and effective way for capturing 
the global coherence, significantly outperforms the Ment-Norm method. The fact shows that our HEGEL 
can encourage richer interactions among different types of information and greatly improve the performance. 


It should be figured out that none of the models can consistently achieve the best F,-score on the all five 
cross-domain datasets. HEGEL outperforms the other two GNN-based methods, NCEL and SGEL, on 
MSNBC and ACE2004. It shows that our HEGEL can handle cross-domain linking cases better than them 
in some extent. 


Our HEGEL performs extremely well on in-domain cases by making full use of different types of linking 
clues for better capturing the global coherence, but it seems that there is no advantage on the cross-domain 
datasets. We found that the ground truth entities of cross-domain test sets are less popular, where the linking 
clues are sparse. To improve the generalization ability on such tough cases, the only effective way seems 
to be introducing large-scale corpus for training, aiming to more or less “see” the linking clues of cross- 
domain entities at the training stage. We will try to introduce large-scale pre-trained language models, such 
as BERT, to improve the generalization ability of our HEGEL in the future. 


As the HEGEL-semi is also implemented under L = 2, it contains one heterogeneous layer and one 
parameter-sharing layer. The results shown in Table 2 approve that although the HEGEL-semi outperforms 
the local model, its lack of heterogeneous information propagation in the second layer leads to the obvious 
drop of performance compared with HEGEL. The heterogeneous GNN is important for HEGEL to achieve 
the good performance. 


Comparing GNED with our simpler and effective way to extract keywords within the first sentence from 
the Wikipedia page of corresponding entity, they search on the whole Wikipedia KB to find the hyperlinks 
to corresponding entity and extract contexts in preprocessing stage, which have to iterate through all |€| 
entities and become very time-consuming. Even with less keyword evidence, our strategy still ourperforms 
GNED on in-domain dataset with lower time overhead. GNED accesses more additional linking clues and 
reach better performance on cross-domain datasets, and we suppose that the richer information can also 
improve the generalization ability of our HEGEL, and further boost our performance on cross-domain 
datasets. 


5.6 Ablation Study 


As shown in the bottom part of Table 2, HEGEL boosts the performance of local model with an average 
improvement of 1.77%, which shows that HEGEL is able to greatly enhance the local model. 
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To further examine the effect of our heterogeneous model, we removed the keyword nodes V,,,~ and 
neighbor nodes V;,,2 from Vp, respectively, and therefore the related edges from Ep as well. After that, there 
is a significant drop in performance (0.89% and 0.61% on average, respectively) across datasets, especially 
in-domain AIDA-B (1.71% and 1.43%). The results demonstrate the effectiveness of introducing the keyword 
(fine-grained type) information and neighborhood information of candidate entities and modeling the 
interactions among them, which can help to accurately capture the topical characteristics of candidates. 


5.7 Analysis 
5.7.1 The Impact of Edge Directions 


As referred in Section 4.2, HEGEL only keeps one direction for E,, and Eye. We suppose that adding 
edges from Vent tO Ven, and from Ven tO Vwo Will lead to the over-smooth problem, as candidates to be 
disambiguated are related to the same mention and maybe the same keywords, where they might entangle 
with each other and make the disambiguation harder. As expected, the results shown in Table 3 prove that 
keeping these edges uni-directional can alleviate over-smooth and enhance the performance. 


Table 3. Experiment results on changing the directionality of edges. 


Models AIDA-B Cross-domain avg. 
HEGEL 93.65 83.44 
+ Vem > Vora 93.15 82.93 
+ Vuen > Ven 92.64 83.03 
+Both 91.21 81.81 


5.7.2 The Impact of the Number of GNN Layers 


Despite the powerful ability of GNN to process graph-structured data, most of them are shallow, which 
means that they do not have many propagation layers. As shown in [32], stacking many layers with non- 
linear function will degrade the performance of GNN-based models due to the over-smoothing problem. 
Therefore, we examined the performance of HEGEL with different number of layers. The results shown in 
Figure 3 agree with previous GNN-related works as HEGEL with K = 2 layers leads to the best performance 
in EL task. Too many layers will lead to the over-smoothing problem, and 1-layer model is not enough to 
propagate the heterogeneous information required for the aggregation and interaction on the graph. 
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Figure 3. The performance with different numbers of layers and residual connection. The cross-domain results are 
average F, scores on five cross-domain datasets. 


To alleviate the over-smoothing problem in training deeper GNN, residual connection [33] is used 
between the hidden layers of GNN as a variant in order to facilitate the information retention through 
deeper models [17]. The residual connection enables HEGEL to carry over the heterogeneous information 
from the input embeddings of previous layer by modifying Equation (14): 

ry =A, Fha (21) 

However, as shown in Figure 3, applying residual connection on the model with K = 2 causes the 
dropping of both in-domain and cross-domain performance. Though the residual connection boosts the 
in-domain performance on the case of K > 3, they are still not comparable with the best performance of K 
= 2. We thought it might be related to the information handling method varying from layers of HGNN, as 
the heterogeneous structure in various propagation steps is obviously too different to be handled by the 
same layer of network correctly. 


5.7.3 Error Analysis 


We randomly sampled and analyzed 100 mentions from all mentions that were incorrectly linked by 
HEGEL from in-domain dataset AIDA-B and the most difficult cross-domain dataset CWEB, respectively. As 
shown in Table 4, the four major error types contain: (1) Topic Errors, which happened when HEGEL links 
the candidate of different (usually unrelated) topics with gold entity, are the main challenge faced by current 
global methods; (2) Similar Entity Error, which means that the predicated candidate and gold entity have 
too similar semantics to be disambiguated by local and global information, and might be solved by 
introducing more information in future works; (3) Related Entity Error, which happened when the predicated 
entity is semantically closely related to the gold one, such as a city and a stadium located in it or a 
hypernym of gold entity; (4) Dataset Annotation Errors, which means the gold entity offered in dataset is 
wrong and different from the predicted one, only occurs in CWEB. 
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Table 4. The major error types and their examples. 


Error types Examples 
Topic Errors .. To win [Timbo]’s trust, Chris chained himself up in the elephant’s enclosure ... 
AIDA-B: 24% HEGEL Timbo (a town in Guinea) 
CWEB: 44% Gold—>Timbaland (an American Musician) 
Similar Entity Errors ... In a gloomy Geneva conference centre built before the dawn of the [Internet], 
AIDA-B: 29% groups of staid officials made a ... 
CWEB: 24% HEGEL->/nternet (the worldwide computer network) 
Gold—World Wide Web (the global system of pages via URL) 
Related Entity Errors .. a small rightwing [Christian] civil war militia, Saqr, whose trial was concurrent 
AIDA-B:47% ss 
CWEB: 29% HEGEL->Christian 
Gold—>Catholicism (the largest Christian church) 
Dataset Annotation Errors __... Brooks Cole Herring. [B.], 2001, Ethical guidelines in the treatment of 
AIDA-B:0% compulsive ... 
CWEB: 3% HEGEL->B.W. Aston (a Texas historian and professor) 


Gold—>B (the second letter 2) 


Note: Square brackets denote the current target mentions. Italicized and underlined entities are the prediction results of HEGEL 
and the gold entities given in datasets, respectively. 


5.7.4 Case Study 


As shown in Figure 2, HEGEL needs to map the mentions “Scotland”, “Murrayfield”, “Cuttitta” and 
“England” in the same document to corresponding entities. “Murrayfield” and “Cuttitta” are not ambiguous 
as they have only one candidate, respectively. However, “Scotland” and “England” are linked to wrong 
candidates by local model, where our HEGEL outputs the right answers by correctly modeling the interactions 
among heterogeneous types of information, especially from the neighborhood around “Marcello Cuttitta” 
(a former rugby union player) and “Rugby Union”, and from the respective keywords related to “rugby”. 
Ablation score calculating results shown in Table 5 manifest that information from keyword nodes Vwora 
and neighbor nodes V:n, and correctly handling the information are both important for HEGEL to correctly 
capture the topical coherence and model the heterogeneous interactions. 


Table 5. Scores in case study. 


Models Scot.—>country Scot.>team Eng.— football Eng.rugby 
Gold Low High Low High 
HEGEL -0.162 -0.144 -0.147 -0.145 
Viera -0.336 -0.309 -0.312 -0.317 


- Vent -0.176 -0.187 -0.170 -0.168 


Integrating Manifold Knowledge for Global Entity Linking with Heterogeneous Graphs 


6. CONCLUSION AND FUTURE WORK 


In this paper, we presented HEGEL, a novel graph-based global entity linking method, which is designed 
to model and utilize the interactions among heterogeneous types of information from different sources. We 
achieved this aim by constructing a document-level informative heterogeneous graph and applying a 
heterogeneous GNN to propagate and aggregate information on the graph, which is hard to achieve by 
previous homogeneous architectures. Extensive experiments on standard benchmarks show that HEGEL 
achieves state-of-the-art performance in EL task. 
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